This paper investigates the estimation of underlying articulatory targets of Thai vowels as invariant representation\nof vocal tract shapes by means of analysis-by-synthesis based on acoustic data. The basic idea is to simulate the\nprocess of learning speech production as a distal learning task, with acoustic signals of natural utterances in the\nform of Mel-frequency cepstral coefficients (MFCCs) as input, VocalTractLab - a 3D articulatory synthesizer controlled\nby target approximation models as the learner, and stochastic gradient descent as the target training method. To\ntest the effectiveness of this approach, a speech corpus was designed to contain contextual variations of Thai\nvowels by juxtaposing nine Thai long vowels in two-syllable sequences. A speech corpus consisting of 81 disyllabic\nutterances was recorded from a native Thai speaker. Nine vocal tract shapes, each corresponding to a vowel, were\nestimated by optimizing the vocal tract shape parameters of each vowel to minimize the sum of square error of\nMFCCs between original and synthesized speech. The stochastic gradient descent algorithm was used to iteratively\noptimize the shape parameters. The optimized vocal tract shapes were then used to synthesize Thai vowels both in\nmonosyllables and in disyllabic sequences. The results, both numerically and perceptually, indicate that this\nmodel-based analysis strategy allows us to effectively and economically estimate the vocal tract shapes to\nsynthesize accurate Thai vowels as well as smooth formant transitions between adjacent vowels
Loading....